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Abstract 

Identity by descent (IBD) inference is the task of computationally detecting genomic segments that are shared between 
individuals by means of common familial descent. Accurate IBD detection plays an important role in various genomic 
studies, ranging from mapping disease genes to exploring ancient population histories. The majority of recent work in the 
field has focused on improving the accuracy of inference, targeting shorter genomic segments that originate from a more 
ancient common ancestor. The accuracy of these methods, however, is achieved at the expense of high computational cost, 
resulting in a prohibitively long running time when applied to large cohorts. To enable the study of large cohorts, we 
introduce SpeeDB, a method that facilitates fast IBD detection in large unphased genotype data sets. Given a target 
individual and a database of individuals that potentially share IBD segments with the target, SpeeDB applies an efficient 
opposite-homozygous filter, which excludes chromosomal segments from the database that are highly unlikely to be IBD 
with the corresponding segments from the target individual. The remaining segments can then be evaluated by any IBD 
detection method of choice. When examining simulated individuals sharing 4 cM IBD regions, SpeeDB filtered out 99.5% of 
genomic regions from consideration while retaining 99% of the true IBD segments. Applying the SpeeDB filter prior to 
detecting IBD in simulated fourth cousins resulted in an overall running time that was 10,000x faster than inferring IBD 
without the filter and retained 99% of the true IBD segments in the output. 
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Introduction 

Identity by descent (IBD) is a fundamental concept in genetics, 
pertaining to the genetic similarities among individuals who co- 
inherited an allele from a common ancestor. While genomic 
sequence variants, such as single-nucleotide variations, insertions, 
deletions, and copy-number variations, are constandy being 
introduced with every generation, genomic IBD segments have a 
high probability of being identical due to the relatively low de- 
novo mutation rates [1,2]. We call a sequence of alleles IBD if the 
alleles have been inherited from a common ancestor without 
recombination events. Because relatively few recombination events 
occur in each generation (roughly one per chromosome), the two 
likely-related individuals co-inherit a genomic segment decreases 
rapidly with every generation, while IBD segments typically stretch 
to significant lengths. Specifically, in the case of two relatives that 
share a relative g generations ago, a genomic segment that 
originated from the common ancestors needs to be transmitted 
over 2g meioses, which corresponds to a probability of 2 l ~ 2s . 
Once transmitted, assuming recombinations follow a Poisson 

100 

process [3], the expected length of the shared segment is — — 

centimorgans (cM). 

Single-nucleotide polymorphisms (SNPs) are commonly used to 
determine whether genomic regions are IBD. Opposite homozy- 
gous loci [4] are positions where two individuals are homozygous 
for different alleles. Within co-inherited regions, two genotyped 
individuals will not exhibit opposite homozygous loci, except in 



case of genotyping errors or mutations that arose more recendy 
than the common ancestor. This simple observation can be the 
basis of an effective IBD filter, as we demonstrate in this article. 

Many biological applications are based on our ability to 
determine whether or not two individuals inherited a genomic 
region from a single ancestor [5,6] . Researchers exploring ancient 
population histories rely on the ability to trace population origin 
and admixture dynamics through the accurate detection of shared 
segments [4]. Studies that map disease genes rely heavily on the 
ability to find genomic regions shared by cases. For example, in 
association studies, one must control for spontaneous sharing, 
while enriched sharing near a particular locus can be indicative of 
a disease association gene [7]. 

Extensive previous work has focused on IBD inference, aiming 
at increasingly longer time-scales, ranging from recent familial 
relatedness, up to several tens of generation to the most recent 
common ancestor. 

PLINK [6] analyzes pairs of individuals using a simple three- 
state model HMM. The states in the model correspond to the 
amount of shared IBD per position given the observed genotypes 
of two individuals. BEAGLE [8] uses a factorial HMM to phase 
the individuals' genotypes and then determines shared haplotypes 
between individuals. Unlike Plink, the BEAGLE model incorpo- 
rates complex linkage-disequilibrium (LD) patterns by constructing 
a compact set of states and constrained transition probability 
matrix that are equivalent to a variable-length Markov model. 
More explicit modeling of the inheritance vector, capturing the 
relationship between individuals, is described in earlier work that 
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Figure 1. Overview of SpeeDB and its downstream application. 

(A) SpeeDB identifies a set of candidate genomic segments (blue) of 
individuals in a database and the corresponding genomic segments of a 
query individual (red) that may be IBD. (B) Candidate segments from the 
database and the corresponding regions in the query will be sent to an 
accurate IBD detection tool (such as PARENTE [12]) for inference. Thus, 
SpeeDB greatly reduces the time necessary to infer IBD on large data 
sets by reducing the total amount of data on which inference must be 
applied. 

doi:10.1371/journal.pone.0092713.g001 

incorporated LD via a first-order Markov model at the level of the 
founders [5] . This explicit modeling of both relationship and LD is 
shown to significantly improve performance. In the work by 
Moltke et al., a Markov Chain Monte Carlo (MCMC) approach 
for IBD inference is presented whereby segments of chromosomes 
are iteratively partitioned into sets of identical descent [9]. The 
above methods present a tradeoff between accuracy and running 
time, where more complex methods require longer running time 
for the analysis. More importantly, the complexity of the analysis 
in all these methods is quadratic in the number of individuals, as 
every pair of individuals is evaluated. GERMLINE [10] reduces 
the time complexity of IBD inference at the cost of lower accuracy. 
The key idea is to perform the analysis on phased data; by 
employing hash tables with segments taken from the phased data, 
the method efficiently identifies seeds of segments that potentially 
represent IBD segments shared between individuals. Henn et al. 
[4] infer IBD from unphased genotype data by comparing SNPs of 
two individuals and identifying opposite homozygous SNPs. A 
region is inferred as a half IBD if (i) two individuals do not have 
opposite homozygous SNPs, (ii) the region is no shorter than 5 cM, 
and (iii) at least one of these two individuals has more than 400 
homozygous genotyped SNPs in this region. The work addresses 
genotyping errors by heuristically allowing one opposite homozy- 
gous SNPs assuming a region is at least 3 cM and it has at least 
300 SNPs. While the method is highly efficient, in practice, it can 
be applied for cases where the expected IBD segments are longer 
than 7 cM, or otherwise the method exhibits low sensitivity. IBD- 
Groupon [1 1] constructs an HMM model to detect IBD segments 
shared by multiple individuals. Although this method is able to 
detect group-wise IBD segments among hundreds of individuals 



sampled over thousands of SNPs, it relies on pairwise IBD 
inference using BEAGLE. PARENTE [12] detects related pairs of 
individuals accurately and rapidly from unphased genotyping data 
using an embedded likelihood ratio test. Similar to many other 
previous work, this method examines every pair of individuals. 

As studies grow to include large cohorts and meta-analyses 
include many diverse data sets, the accuracy of IBD inference 
methods becomes extremely important. However, the running 
time of these accurate IBD inference methods makes them 
computationally infeasible to apply on very large data sets. Here, 
we describe an efficient and accurate filtering method for IBD 
detection which we named SpeeDB. Given a query unphased 
genotyped individual and a database of unphased genotyped 
individuals, SpeeDB rapidly screens out genomic regions in the 
database that are unlikely to be IBD with the query, and returns 
the remainder of genomic regions. These regions can be passed 
onto traditional IBD inference methods such as PARENTE [12], 
fastlBD [13], or GERMLINE. Our method is designed so as to 
readily translate into an indexing scheme that can be used for 
searching for IBD segments within a large database of individuals. 
Along with the developed fast query approach, our method thus 
provides a practical infrastructure for large-scale IBD detection. 

IBD inference methods developed so far have focused on 
applications where the analysis is performed once on a collected 
cohort. With the exponential increase in GWAS studies, meta- 
analysis further gained momentum, merging samples from several 
studies in an attempt to leverage the large data sets to increase the 
statistical power. We developed SpeeDB, in part, to streamline 
such analyses and provide a continuous solution to the problem of 
IBD inference. We chose to architect our method so as to allow the 
continuous addition of samples to an on-going study; newly 
sampled individuals are queried against a growing database of 
samples for shared ancestry. 

Methods 

IBD inference is commonly applied on a set of individuals. 
Toward the goal of identifying IBD between all pairs of individuals 
in a cohort, we focus on a first step of identifying potential IBD 
between a single individual and members of a cohort (see Fig. 1). 
For the description of our method, we borrow the terms query and 
index from the database nomenclature to refer to the given 
individual (the query individual), and the other members of the 
cohort (the indexed individuals). 

We first assume that individuals are measured with no 
genotyping errors at the sampled markers. The above assumption 
will be relaxed in the next section. Our method relies on the 
observation that given no genotyping error, two related individuals 
should have at least one allele in common, resulting in no opposite 
homozygous SNP sites along the co-inherited segment. Converse- 
ly, when opposite homozygous SNPs are observed, under the 
assumption of no genotyping errors, two individual cannot be 
related in the corresponding site. Note that, we do not account for 
the rare event of de novo mutation for ease of discussion. It follows 
that given the set of homozygous minor allele sites in the query 
individual, indexed individuals can be scanned along these 
markers; indexed individuals exhibiting homozygous major allele 
calls at any of these positions can be discarded from further 
consideration around that particular marker location. 

Our method consists of two steps: indexing and query 
processing. First, during the indexing step, individuals' genotypes 
are scanned and transferred so as to allow efficient query. In this 
step, for every marker ieM of the m=\M\ biallelic markers, we 
generate a set of indexed individuals h' 0 ', in which every element is 
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Figure 2. Sliding window along a chromosome. When processing a query genome, we consider a window with size />af w indow cM in each step, 
where pac w i ndow is less than the target IBD block length L. The indexed individuals that do not exhibit opposite homozygous SNPs with respect to the 
query in this window define the set of candidate individuals that may be IBD with the query around this window. In the subsequent iteration, the 
window moves along the chromosome by par stcp . 
doi:10.1371/journal.pone.0092713.g002 



an indexed individual that is homozygous in the major allele. 
Given a database with n individuals sampled over m biallelic 
markers, this step takes 0(nm) time. The computation is 
performed once, and the resulting sets are stored on disk. During 
the query phase, given a query individual, we use a sliding window 
w along the corresponding query genome. The set of markers 
located within a window W are denoted by M w c M. The sliding 
window process is parameterized by the window size i?ar W mdow 
and step size par slep . As we slide the window along the genome, we 
narrow our search to specific markers where the query individual 
has homozygous minor alleles, denoted as M*'® cz M w . For every 
such position, we examine all selected markers within the current 
window; we identify all individuals that do not appear in any h| 0 ' 
sets, for any ieM ' where the query individual q is homozygous 
minor within the region indicated by the window W. These 
individuals potentially share IBD segments with the query 
individual, within regions that harbor the examined window. As 
such, the size of the window par wm d ow constitutes a lower bound 
for the potential IBD segment length. As one end of the IBD 
segment can potentially lie within [a —par step , a] , and the other lies 
within [a+par wmdow , a+par^ado^+par^p], we report 
[a— /»ar step , a+/>flf w i n( i ow +/>flr s te P ] as the candidate segment. 
Intersecting candidate segments produced by subsequent windows 
analysis are merged, producing the final reported set of candidate 
regions. When examining a window, the set h' 0 ' is visited exacdy 



Table 1. The probability of the observed genotypes G° given 
the true underlying genotype G'-. 





True genotype G[ 


Pr{G/} 


Observed genotype Gf 


Pr{Gf|G/} 


2 
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(1-9) 2 


1 29(1-9) 
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9(1-9) 


i e 1 +(i-9) 2 
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9(1-9) 
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{I -Pi) 2 


2 
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1 29(1-9) 






0 


(1-9) 2 



Given the true genotype at a marker GJ, the conditional probability of 
observing an genotype G" can be written as a function of the error rate 9. 
doi:l 0.1 371 /journal.pone.009271 3.t001 



once for all i where the query individual is homozygous in the 
minor allele, hence the expected computation of a query 
processing is 5Z, e M?' (2) \ i I' P< ^ e tne mmor allele frequency 
(MAF) at marker (. Given the expected set size of h- can be 
estimated as (1 — pi) x n and that the probability of the query 
individual exhibiting homozygous minor at marker i (where ieM w ] 
is p], the expected number of operations is given by 
zCieM,,.-/ 7 ? x (1 — Pi) x n - Once a window has been examined, 
the analysis advances to the next examined window W ', by par s t ep 
cM. Rather than performing all of the above calculations from 
scratch for each window, we reuse many of the calculations in the 
previous window when we slide to the next one by eliminating the 
impact of M^ 2 \M q J 2) and add the impact of M q J 2) \M?/ 2) . This 
process is repeated for the entire genome. The process is shown 
Fig. 2, illustrating how the IBD segment, marked in red, is targeted 
for detection. 

To further increase the efficiency by which we prune the 
candidate list, a second filter is applied focusing on positions where 
the query individual is homozygous in the major allele can be 
derived. Using this second filter, one can eliminate segments 
corresponding to indexed individual that is homozygous in the 
minor allele. An indexing phase, similar to the one described 
above, is performed for this second filter prior to query. A second 
set hj 2 ' is computed for every marker i, containing indexed 
individuals that are homozygous in the minor allele at this marker. 
We refer to the first and second described filters as the Major Filter 
and the Minor Filter throughout the text, respectively. The output 



Table 2. Running time. 





Time (sec) 


SpeeDB 


0.1 


SpeeDB+PARENTE 


2.2 


PARENTE 


427.0 


GERMLINE 


2,206.9 


fastlBD 


4,267.0 



SpeeDB's running time is shown in comparison to the running times of IBD 
detection methods PARENTE, GERMLINE, and fastlBD. SpeeDB+PARENTE 
denotes the total time for using SpeeDB followed by running PARENTE only on 
the candidate segments output by SpeeDB. Calculations based on analyzing 
6,400 pairs of individuals (one query vs. 6,400 indexed individuals for SpeeDB 
and SpeeDB+PARENTE). SpeeDB was run with thresholds corresponding to 99% 
sensitivity. 

doi:1 0.1 371 /journal.pone.009271 3.t002 
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Figure 3. Performance of SpeeDB for identifying 4 cM candidate IBD segments. (A) The speedup achieved as a function of sensitivity. 
Speedup is measured as 1 divided by the fraction of the database retained after applying SpeeDB. Sensitivity is measured as the fraction of simulated 
IBD segments that pass SpeeDB's filters. (B) The running time for comparing one query to the entire database. 
doi:1 0.1 371 /journal.pone.009271 3.g003 



of SpeeDB are segments that pass both filters. 

Tolerating Genotyping Errors 

In the above section, we assumed that no genotyping errors are 
present in our samples. The assumption significantly simplified 
calculations; however, the reality is that every sample will have 
many genotyping errors. Genotyping errors can introduce false- 
positive IBD candidates, as well as falsely remove true IBD 
segments from the candidates set. New false-positive segments may 
be introduced in cases where opposite homozygous calls in either 
the query or the indexed individuals have been disrupted by the 
error. Alternatively, a true candidate may be falsely removed from 
the list in cases where opposite homozygous calls were introduced 
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Figure 4. Fraction of database retained as a function of 
database size. The candidate fraction (the fraction of the database 
retained after applying SpeeDB) is observed to be nearly independent 
of database size at a given sensitivity level. 
doi:1 0.1 371 /journal.pone.009271 3.g004 



by errors. As our method's main objective is to provide a filtering 
process achieving sufficiendy high sensitivity (i.e., a method that 
does not eliminate true positives), we chose to optimize on the 
latter. Namely, our method will only account for genotyping error 
that might remove a true IBD candidate from the list. 

Let 6 be the probability of a genotyping error occurring, defined 
as the probability that an allele in the genotype call is incorrect. 
We denote the minor allele count at position i as genotype G,. 
Table 1 lists the probabilities for an observed minor allele count, 
given the true underlying allele count. 

Consider all the selected markers within a window. One can 
accurately compute the probability of the observed opposite 
homozygous SNPs between the query and an indexed individual 
while accounting for genotyping errors. If the observed opposite 
homozygous SNPs have a sufficiently high probability of being the 
result of genotyping errors given the other evidence that support 
the presence of an IBD segment, the filter should not screen the 
corresponding segment. 

However, the exact computation of this probability can be 
relatively time-consuming, especially in the case of large data sets 
and those with high SNP density. To reduce computation time, we 
approximate the probability that the query has u false opposite 
homozygous SNPs within the examined window with respect to 
any indexed individual, denoted by f{u). Given this approximated 
probability, we find the minimum value T that guarantees that the 
probability of having more than x false opposite homozygous calls 
is bounded by some pre-defined low probability. In the database 
processing, we count the number of the observed opposite 
homozygous SNPs in each indexed individual, and eliminate the 
ones with more than z observed opposite homozygous SNPs, as 
these individuals are deemed as likely to have true opposite 
homozygous calls with the query. 

Consider the Major Filter first. Given the query individual, we 
are able to predict the probability of having a false opposite 
homozygous call at a certain position with respect to an indexed 
individual. By definition, an opposite homozygous SNP is 
observed at a certain position if the following two statements are 
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Figure 5. The performance of SpeeDB at various simulated error rates. Queries for the WTCCC data set were simulated with various error 
rates 0 and the speedup observed remained within an order of magnitude even for high error rates. 
doi:1 0.1 371 /journal.pone.009271 3.g005 



both true: (i) the marker was selected during marker filter 
construction, and (ii) the indexed individual has an observed 
major allele count of 2 at the same marker. Let Gf and G\'° 
correspond to the observed minor allele count at position i for the 
query individual q and an indexed individual t, respectively. 
Similarly, let Gf r and G\' correspond to the true underlying 
minor allele count at position i for the query individual q and the 



indexed individual /, respectively. The above two statements can 
be represented as (i) Gf = 2 and (ii) G''° = 0. However, the 
underlying true minor allele count of the query could be either 0, 
1, or 2, regardless of the observed homozygous minor allele 
measurement. The true genotype of an indexed individual could 
be different from the observation as well. These three cases are 
discussed below. 
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Figure 6. The influence of parameter par step . SpeeDB was applied on the WTCCC data set with various step sizes. While lower step sizes results in 
higher accuracy, it comes at the cost of more computational time. 
doi:1 0.1 371 /journal.pone.009271 3.g006 
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Table 3. The contributions of the Major Filter and the Minor 
Filter to the overall performance. 





Speedup 






Sensitivity 


Both filters 


Major Filter only 


Minor Filter only 


99% 


54.3x 


10.4X 


9.3x 


98% 


74.6x 


13.7x 


10.9x 


95% 


113.2X 


18.6x 


17.4x 



Each filter has approximately the same pruning power and using both together 
results in the best performance. 
doi:1 0.1 371 /journal.pone.009271 3.t003 



In the first case, the query truly has two copies of the minor 
allele, i.e., Gf = 2. In this case, the genomic segment of the 
indexed individual should not be filtered out if it has at least one 
copy of minor allele. The probability of an indexed individual 
having a false opposite homozygous SNP in a selected marker 
where the query is truly homozygous minor is given by 

Pr{Gf =2,G? '=x\Gf = 0,Gf = 2}. (1) 

XG{1,2] 

In the second case, the query is heterozygous, namely, Gf = 1 . 
In this case, all the opposite homozygous SNPs at this marker will 
falsely generate a rejection. The probability for the event is given 
by 

Pr{Gf = l,Gf '=x\G? =0,Gf° = 2}. (2) 

,ve{0,1,2} 



Finally, in the last case, the query has two copies of major allele 
(i.e., Gf r = 0). The genomic segments of indexed individuals with 
at least one copy of major allele should not be removed from 
candidate list. The probability for this event is given by 

Pr{Gf r = 0,G'f r = x\G'f° =0,Gf° = 2}. (3) 

-ve{0,l} 

Summing up Eq. (l)-(3) gives us the probability of having a false 
opposite homozygous SNP between an indexed individual and the 
query in a certain marker position, conditioned on the fact that the 
marker meets the filter selection criteria. Assuming the query 
genotypes are independent, for any given marker i, the probably of 
having a false opposite homozygous call due to genotyping error is 
given by the following summation: 

P»i= E E Pr{Gr=y,Gy=x\Gf° = 0,G?-° = 2} (4) 

je{0,l,2}xex(j) 

where, 



r {o,i}, >>=o 

x(y)=l {0,1,2}, y=\ (5) 
[{1,2}, y = 2 

The conditional probabilities of the observed genotypes given the 
true underlying genotypes are given in Table 1 . Substituting these 
probabilities into Eq. (4) formulates p„,j as a function of the 
genotyping error rate 8 and the minor allele frequency p. 

The approximation of f(u) is specified below. We use a fixed 
value for the minor allele frequency p to represent all markers and 
compute p w by substituting this value into Eq. (4). Then, assuming 




Figure 7. Performance vs. marker density. Markers were sub-sampled from the HapMap data set to varying degrees and speedup was measured 
with at a 99% sensitivity level. The densities corresponding to specific data sets used in other experiments are labeled. 
doi:1 0.1 371 /journal.pone.009271 3.g007 



PLOS ONE | www.plosone.org 



6 



March 2014 | Volume 9 | Issue 3 | e92713 



SpeeDB: An Effective IBD Filter 



100000X; 



10000X 



1000x 




100X* 



0.95 



0.9 
Sensitivity 









4th cousins 
-*-5th cousins 

6th cousins 
— ■ — 7th cousins 





















0.85 



0.8 



Figure 8. Performance of SpeeDB for identifying candidate close relatives. Performance was measured on query individuals that were 



simulated to have one IBD segment in size equal to — — 
doi:10.1371/journal.pone.0092713.g008 2-(d+l) 



cM where d is the degree of cousin. 



v markers are selected in the current window, we estimate the 
probability that an indexed individual has u genotyping-error- 

(y 

induced false opposite homozygous SNPs by f{u) = I 

(PnT '(^ —fwY " ■ Thus, to account for x erroneous opposite 
homozygous SNPs, the probability that a genomic segment of an 
indexed individual being IBD with the query and will not pass the 
filter due to genotyping errors is estimated to be 1 — ~}2l = of(u). 
To bound it by a pre-defined threshold p t h, we find the minimum 
value of T that guarantees that 



(6) 



The value t is positively correlated with the marker count V and 
is negatively correlated with the minor allele frequency p. We 
conservatively set the fixed minor allele frequency p to be its 
minimum value observed in the window. By doing so, the filter 
maximizes tolerance with respect to false opposite homozygous 
SNPs. 

When considering the Minor Filter, Gf = 0 and Gf" = 1 are 
considered as the opposite homozygous SNPs. Following similar 
argument as above, the conditional probability of having a false 
opposite homozygous SNPs at a certain marker location given the 
marker is selected for constructing the Minor Filter can be written as 
follows 

Pl,i= E E Pl '{ G '''' = ?' G 'i' = A 'l G 'i" = 2 ' G i'° = °} ( 7 ) 
><e{0,l,2} xex(y) 

where x(y) is as same as Eq. (5). 



Similar to the analysis of the Major Filter, assuming v* markers 
are selected, we need to find the minimum T* to guarantee 



(8) 



where, p* w is computed with a fixed p value. For this filter, we set 
the fixed p value to be its maximum value to be conservative. 

Improving Pruning Efficiency 

As we perform that analysis on a dense set of bi-allelic 
markers, high levels of linkage disequilibrium are expected 
between neighboring SNPs. The non-random association 
between alleles can in turn result in high concordance between 
h representing neighboring positions. Indeed, our analysis 
reveals that for neighboring markers present such high 
concordance. We can thus select only a subset of the markers; 
a single marker with set h can represent other markers with the 
similar h's which will otherwise add a non-significant pruning 
power. Based on this observation, we use the following Jaccard 
index criteria: 



|h,-nh;| 
|h,-Uh/| 



>1j 



(9) 



where h, and hj represent indexed individual sets for markers i 
and j, respectively. During the indexing phase, two markers with 
a Jaccard index higher than r\j will be marked as representing 
approximately the same set using a unique label. During the 
query phase, when constructing the appropriate filter, only 
markers with unique labels are selected. Thus, the markers that 
will presumably add little pruning power are discarded. 
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Results 

Experimental Setup 

To evaluate the performance of SpeeDB, we used data from 
255 individuals from Asian populations of the HapMap Phase III 
panel [14], and from 1,480 individuals from the 1958 Birth 
Cohort of the Wellcome Trust Case Control Consortium 
(WTCCC) [15]. Our analysis focused on the long arm of 
chromosome one which contained 36,638 and 13,943 SNPs for 
HapMap and WTCCC, respectively. We defined a set of indexed 
individuals from each data set using the genotypes of all 
individuals in the data set, which we call HapMap and WTCCC. 
We generated two additional data sets, HapMap. lowD and 
HapMap. 1M, from HapMap by using all HapMap individuals 
but downsampling the markers used. HapMap.lowDwss sampled to 
have the same marker density as WTCCC, and HapMap. 1M was 
sampled to have a marker density corresponding to 1 million 
markers across the entire genome. We also combined the data 
from 14,318 individuals across all of the disease and control 
cohorts of the WTCCC study into a data set which we call 
WTCCC.largeSet. Since haplotypes were necessary for our simula- 
tions, we used only trio-based phased data from HapMap and we 
phased the WTCCC data using HAPI-UR [16]. 

For each data set, we simulated query individuals that each 
contained one shared 4 cM IBD segment (except where otherwise 
specified) with exactly one haplotype of one individual in the data 
set. To do this, for each query, we first created two composite 
haplotypes using the original haplotypes from the data sets in such 
a way to break up any latent IBD segments that may exist between 
individuals in the data set in a way that is similar to the approaches 
taken in previous work [12,13]. Specifically, we first partitioned 
the genome into a series of 0.2 cM segments. Each segment of a 
composite haplotype was made by copying alleles from a randomly 
selected original haplotype as illustrated in Fig. SI A. Next, we 
simulated a 4 cM IBD segment between the query individual and 
a random indexed individual at a random location by copying one 
of the haplotypes of the indexed individuals over one of the 
haplotypes of the query individual in the segment boundaries (see 
Fig. SIB). Finally, we introduced errors in the query individual's 
haplotypes with an error rate 6. For each experiment, we 
generated a set of 10,000 queries. 

Except where otherwise noted, the filter parameters were set as 
follows: 0 = 0.005 [12]; the window size par vm< i ov = 3.5 cM and 
the step size par^p = 0.5 cM (appropriate for identifying 4 cM 
IBD segments); and the Jaccard index parameter rjj = 0.9. All 
experiments were performed on an AMD Opteron 6172 
Processor. 

Experimental Results 

We performed a number of simulations to demonstrate the 
efficiency and performance of SpeeDB on various data sets. To 
measure efficiency, we report the average running time for filtering a 
database based on a given query. To measure performance, we 
report the sensitivity as the fraction of IBD segments that pass 
SpeeDB's filters and the speedup resulting from filtering out large 
portions of the database. The speedup is calculated as \ / (candidate 
fraction) where the candidate fraction is defined as the fraction of the 
database that passes the filters, where the database size is defined 
as the genetic length times the number of individuals in the 
database. That is, the candidate fraction represents the fraction of 
database that SpeeDB determines may have IBD segments with 
the query, which will then be passed on to a downstream accurate 
IBD detection method. 



In the HapMap data set, SpeeDB was able to reduce the total 
amount of work required for downstream IBD detection by 99.5% 
while retaining 99% of the true IBD segments in the output and 
only took 3.2 ms per query to run. This effectively means that IBD 
can be detected 200 times faster while only reducing sensitivity by 
at most 1%. 

We applied SpeeDB in conjunction with a state-of-the-art IBD 
inference method PARENTE and theoretically estimated the total 
running time necessary to run SpeeDB and then run PARENTE 
on the portion of the data that passed SpeeDB's filtering. Our 
results in Table 2 show that SpeeDB's running time is negligible 
compared to PARENTE, and that running PARENTE on the 
output of SpeeDB requires significantly less time than running 
PARENTE on the entire data set. We also compared the 
conjunction of SpeeDB and PARENTE's running time to 
GERMLINE and fastlBD; the results are shown in the same 
table. Note that, SpeeDB and PARENTE are designed to operate 
on unphased genotype data. While GERMLINE can also run on 
unphased data, its accuracy in this mode is significantly lower, 
making it impractical to use on large data sets. We therefore ran 
GERMLINE on phased data, which consequentiy requires a 
phasing pipeline to be run prior to IBD inference. The running 
time in Table 2 includes this overhead. 

The results of running SpeeDB on each data set are shown in 
Fig. 3. SpeeDB's performance, measured by the speedup achieved 
by using the filter (assuming negligible filter running time) while 
retaining a particular sensitivity level, are shown in Fig. 3A. Each 
performance curve was generated by varying the error tolerance 
thresholds p t h and p* lh in Eq. (6) and Eq. (8). Because we had two 
thresholds to vary, we could not describe the performance of 
SpeeDB with a single ROC curve. Instead, it could be described 
with many potentially-overlapping ROC curves, each produced by 
fixing one threshold to a particular value and varying the other 
(see Fig. S2). For clarity, each performance curve in Fig. 3 shows 
the Pareto frontier (i.e. the leftmost points) of all the ROC curves 
generated for each data set. Because we show this frontier, there 
are a few bumps in the curves resulting in where the individual 
ROC curves intersect (see Fig. S2 to view the individual ROC 
curves for the HapMap data set shown here). It is interesting to note 
that the vast majority of the results with / , th// , th = 2.5 to 5 fall on 
or very close to the Pareto frontier in all the data sets evaluated. 

The results in Fig. 3A show that SpeeDB filters out a large 
fraction of the database while maintaining a high level of 
sensitivity. For example, for the HapMap data set, it can attain a 
99% sensitivity level while increasing the speed of IBD detection 
200-fold by filtering out the 99.5% of the database as possible IBD 
locations. This figure also shows that SpeeDB performs better with 
higher marker density. We first observed that SpeeDB performs 
much better on HapMap compared to the WTCCC data sets and 
we also noted HapMap has higher marker density than the 
WTCCC data sets, so we suspected that this may be the reason. 
To confirm that marker density was a major contributing factor to 
better performance, we downsampled markers in HapMap to 
create HapMap. lowD to have the same the marker density as seen 
in WTCCC; 22,695 markers were removed randomly in this 
process. We then observed that the performance of HapMap. lowD 
is very similar to the performance of WTCCC and WTCCC.largeSet, 
indicating that marker density is a key factor affecting perfor- 
mance. Despite the lower density of WTCCC, SpeeDB still 
achieves a 54. 3x speedup at 99% sensitivity on the WTCCC data 
set. 

The efficiency of SpeeDB is illustrated in Fig. 3B and Table 2. 
These results show that SpeeDB only requires a few milliseconds 
to analyze each query. 



PLOS ONE | www.plosone.org 



8 



March 2014 | Volume 9 | Issue 3 | e92713 



SpeeDB: An Effective IBD Filter 



To examine the scalability of SpeeDB, we created smaller 
databases by using subsets of individuals in the original databases. 
We observed that querying HapMap requires more time than 
querying WTCCC with the same number of individuals. We 
attributed it to the difference in marker density of these two 
databases. To illustrate this point, we used the HapMap.lowD data 
set. As shown in Fig. 3B, the running time for querying 
HapMap.lowD and WTCCC is almost identical. We also show that 
the running time is linear to the database size, implying that this 
tool can be utilized on even larger data sets. Its applicability to the 
large WTCCC data set with 14,318 individuals is demonstrated in 
Fig. S3. Assuming the marker density is as same as WTCCC, it is 
estimated that querying a database with 1 million individuals for 
IBD on the long arm of chromosome 1 takes roughly 4 seconds; 
assuming the marker density of HapMap, the estimated time is 
roughly 12 seconds. 

The scalability of SpeeDB in terms of pruning power is 
demonstrated in Fig. 4. Again, we created smaller databases by 
using subsets of individuals in WTCCC, and applied SpeeDB on 
these databases. Fig. 4 shows the candidate fraction as a function 
of database size at three sensitivity levels. For 99% sensitivity, the 
candidate fraction appears to be independent of database size, 
implying that the applicability of SpeeDB on large databases is 
promising. 

We also investigated how often SpeeDB's output contains the 
entire true IBD segment such that no part of the true IBD segment 
lies outside of the candidate regions. To do this, we computed the 
performance with the more stringent requirement that an IBD 
segment must be entirely covered by candidate regions to be 
considered correct, otherwise the part of the candidate region that 
overlaps the IBD segment are considered incorrect. In our 
experiments, we found that in the vast majority of cases, SpeeDB's 
output correctly captures the entire true IBD region. For example, 
in the HapMap and WTCCC data sets, when SpeeDB identifies 
99% of all markers in IBD (overall sensitivity), SpeeDB detects 
98.76% and 98.65% of all IBD segments to be fully covered by 
candidate regions, respectively. 

Discussion 

In order to examine SpeeDB's resistance to noise in the 
genotyping process, we ran several experiments varying the rate of 
simulated genotyping errors while fixing the modeled error rate 
0 = 0.005 (therefore, we used the same /?th and pl h ). The results of 
this experiment are shown in Fig. 5 which demonstrates SpeeDB's 
robustness to genotyping errors. At the 99% sensitivity level, when 
varying the error rate from 0 = 0.0025 (a realistic error rate for 
modern genotyping arrays) to 0 = 0.01, the speedup achieved 
remains within an order of magnitude, between 25x and 84x. 

The choice of step size also influences the performance of 
SpeeDB; a larger step size results less computation but reduced 
effectiveness. This is because at a particular window, SpeeDB 
reports on the potential IBD status the window itself plus the 
par step cM flanking both sides of the window. Because par wm d ow 
was set to the injected IBD segment size minus par step , increasing 
the step size reduces the expected number of markers that will be 
used to perform filtering, resulting in reduced effectiveness. 
However, a larger step size means fewer windows will be visited 
overall, resulting in fewer computations. We ran SpeeDB with 
various step sizes in order to measure its accuracy as a function of 
step size and the results are shown in Fig. 6. With the step size 
increasing from 0.25 cM to 0.5 cM to 1 cM, the running time per 
query decreases from to 7.6 ms to 5.9 ms to 5.5 ms, but the 



speedup observed at 99% sensitivity decreases from 55. lx to 54.3x 
to 40.2x. 

SpeeDB is composed of two filters. When the minor allele 
frequency is very low, we noticed that the Major Filter is very 
sensitive to the genotyping errors. With this observation, we 
eliminated low MAF (p<0.2) markers from this filter. Table 3 
shows that each filter by itself has comparable pruning power, 
however the combination of the two yields significantly higher 
performance. Additionally, the running time on these two filters 
are roughly equal. 

We introduced an optimization technique in the Methods 
section to ignore markers that provide littie pruning power for a 
given query. Intuitively, when most of the indexed individuals 
have the same genotypes at two markers, using both markers for 
filtering provides little additional filtering benefit over using just 
one of them. Therefore, we applied the Jaccard index criteria to 
label groups of markers with genotypes that are similar to one 
another so that only one SNP from each group is used during 
filtering calculations. For HapMap, which had 36,638 markers, this 
resulted in 9,831 and 12,997 distinct labels for the Major Filter and 
the Minor Filter, respectively. For WTCCC, which had 13,943 
markers, this resulted in 5,500 and 8,366 distinct labels for the 
Major Filter and the Minor Filter, respectively. This optimization 
reduces the considerable number of selected homozygous markers 
in filters. As a result, the running time for processing a query is 
reduced by half at the expense of higher candidate size. By using 
the optimization technique, the candidate fractions at the 
sensitivity of 99% and 98% are 20.1% and 11.1% higher than 
not using it. This technique was used in all experiments to expedite 
the computation; however using it remains optional as the 
difference in candidate size could be important in larger data sets. 

Marker density has a strong effect on the performance of 
SpeeDB. We further explored their correlation by uniformly sub- 
sampling the markers in HapMap to various marker densities. The 
results shown in Fig. 7 are measured at a 99% sensitivity level. We 
observe that the speedup achieved by SpeeDB increases super- 
linearly with marker density, implying that SpeeDB can provide 
better pruning power if applied to data sets with high marker 
density. This promising result implies SpeeDB's increasing 
relevance as researchers transition to using whole-genome 
sequencing instead of genotyping arrays. 

To investigate the utility of SpeeDB on identifying IBD between 
close relatives, which is important in hereditary disease studies, we 
simulated cousins of four different degrees based on the data in 
WTCCC. For each experiment, 10,000 queries were generated 
with each query simulated as a d — th cousin of an indexed 

100 

individual by injecting one IBD segment with length L= ^jZjTTj 

cM. par^ was set to 0.5 cM; par wm d ov , was set to the injected 
IBD segment size minus par step , round up to closet half cM. The 
results on fourth to seventh cousins are presented in Fig. 8. Our 
results show that SpeeDB offers a very high pruning power when it 
is applied to detect close relatives, compared to the performance 
for detecting distant relatives sharing a 4 cM IBD segment. In 
particular, for fourth cousins as an example, SpeeDB achieves a 
10,000x speedup with 99% sensitivity. Due to the fact that the 
number of selected homozygous markers within a window depends 
strongly on the window size, SpeeDB gradually loses its pruning 
power for more distant relatives. Nevertheless, even for seventh 
cousins, SpeeDB achieves a 86 7x speedup at a 99% sensitivity 
level. 

SpeeDB is designed to provide a list of regions for each pair to 
downstream analysis methods. The algorithms used in GERM- 
LINE, PARENTE, and fastlBD are amenable to using these 
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regions to narrow the portion of the genome analyzed for each 
pair of individuals because they consider only local genomic 
information for IBD inference. This is on contrast to methods that 
consider the entire chromosome for each pair as is done with 
traditional HMM methods. The next version of PARENTE is 
being implemented to allow these regions to be specified and the 
implementations of GERMLINE and fastlBD can be modified in 
order to take these regions as input. 

As variants detected through NGS data become increasingly 
available, the requirement for scalable analysis methods becomes 
clearer, and tools such as SpeeDB become an essential enabler for 
such large-scale studies. In particular, assuming millions of variants 
per samples and tens of millions unique variants across all 
individuals stored [17], we can extrapolate on the expected 
performance. For instance, when analyzing a database of 100,000 
individuals, assuming 40 million genome-wide variants, and 
roughly 10% of them are non-reference in a query individual, if 
we index all the variants, the analysis would take a few hours. In 
reality, a large portion of variants have very low frequencies 
(<0.5%). Because hardly any sample is homozygous minor at 
these markers, these data have almost no contribution to the 
pruning process. We can easily eliminate them from our analysis 
without sacrificing the performance. By doing so, we are able to 
improve pruning efficiency markedly. 

Conclusion 

As studies grow to include larger cohorts, state-of-the-art IBD 
detection algorithms become computationally infeasible. To 
enable impending disease mapping studies of increasingly large 
cohorts, a scalable infrastructure is required. This infrastructure 
must be able to further support the ongoing nature of such studies 
by providing the ability to incorporate newly collected samples. 
We implemented SpeeDB to provide a practical and scalable 
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